ARIMAX/SARIMAX/VAR Models

Author

Xiaojing Ni

The goal of this project is to explore the relationship among water quality (i.e. sediment, turbidity) and quantity variables (i.e. stream discharge) and climate variables such as precipitation and temperature.

For SARIMAX model, the dependent variables in the SARIMAX model is turbidity, which can be affected by stream discharge (Pejman et al. (2009),Mukundan et al. (2013)) and climate variables (Wang (2014), Roosmalen, Christensen, and Sonnenborg (2007)).

For VAR model, the chosen variables are stream discharge, precipitation, and sediment. Due to the water cycle effect, stream discharge and precipitation have a relationship on both sides. Sediment is usually considered flow with stream flow and caused by high precipitation (Foster and Meyer (1977)). However, due to the water cycle effect and time series factor, there are researches indicate that high turbidity and sediment is an indication of soil erosion (De Vente et al. (2013)) and can cause later higher stream discharge. Thus, for VAR model, the three variables are used to figure out which process is more significant.

Data preparation

In this session, all the data including stream discharge, turbidity, sediment, precipitation, and temperature are processed. The steps include basic data cleaning, such as filling in missing values, trimming all the data to same time frame, and converting to time series format.

Please see the folded code for processing details.

Turbidity (2021-11-24 - 2023-01-28)

Final product is a time series from 2021-11-24 to 2022-12-31 with first 6 elements shown below.

  • Code
    ## read data
    turb <- read.csv("./data/turbidity.csv", header = F, comment.char = "#")
    turb <- turb[3:nrow(turb),1:9]
    colnames(turb) <- c("prefix", "station","date","max","flag1","min","flag2","mean","flag3")
    turb$date<-as.Date(turb$date,"%m/%d/%y")
    
    ## extract daily max
    turb_max<- turb[c("date","max")]
    turb_max$date<-as.Date(turb_max$date,"%m/%d/%y")
    turb_max$max = as.numeric(turb_max$max) 
    ## interpolate the missing data using moving average
    library(imputeTS)
    turb_max$max <- na_ma(turb_max$max, k = 4, weighting = "exponential")
    
    # min(turb_max$date);max(turb_max$date)
    
    ## extract to 12/31/2022 as the climate max date
    turb_max = turb_max[turb_max$date <= "2022-12-31", ]
    # head(turb_max)
    # min(turb_max$date);max(turb_max$date)
    ## sanity check
    # str(turb_max)
    # sum(is.na(turb_max))
    
    turb_df <- data.frame(turb_max$date,turb_max$max)
    
    # convert to time series
    turb_ts <- read.zoo(turb_df)
    head(turb_ts)
    2021-11-24 2021-11-25 2021-11-26 2021-11-27 2021-11-28 2021-11-29 
           6.4        6.3        5.9        8.8        5.6        6.0 
  • Sediment (1950-04-12 - 2003-09-29)

    Final product is a time series from 1993-9-29 to 2003-09-29 with first 6 elements shown below.

  • Code
    ## import and processing data
    sed <- read.csv("./data/sediment.csv", header = F, comment.char = "#")
    sed <- sed[3:nrow(sed),3:4]
    colnames(sed) <- c("date","sed")
    sed$date<-as.Date(sed$date,"%m/%d/%y")
    sed$date <- as.Date(ifelse(sed$date > Sys.Date(), 
      format(sed$date, "19%y-%m-%d"), 
      format(sed$date)))
    
    sed$sed <- as.numeric(sed$sed)
    
    ## interpolate the missing data using moving average
    library(imputeTS)
    sed$sed <- na_ma(sed$sed, k = 4, weighting = "exponential")
    # min(sed$date);max(sed$date)
    
    ## extract from 1983-9-29 to 2003-09-29, 20-years of data
    sed = sed[sed$date >= "1983-9-29", ]
    
    sed_df <- data.frame(sed$date,sed$sed)
    
    # convert to time series
    sed_ts <- read.zoo(sed_df)
    head(sed_ts)
    1983-09-29 1983-09-30 1983-10-01 1983-10-02 1983-10-03 1983-10-04 
             9         10         18         20         24         22 
  • Stream discharge (1940-01-01 - 2023-02-14)

    • For turbidity model, 2021-11-24 to 2022-12-31, the first 6 elements are shown below.

      Code
      ## import and processing data
      discharge <- read.csv("./data/discharge.csv", header = F, comment.char = "#")
      discharge <- discharge[3:dim(discharge)[1],]
      colnames(discharge) <- c("prefix", "station","date","discharge","flag")
      discharge <- discharge[c("date","discharge")]
      discharge$date<-as.Date(discharge$date,"%m/%d/%y")
      discharge$date <- as.Date(ifelse(discharge$date > Sys.Date(), 
        format(discharge$date, "19%y-%m-%d"), 
        format(discharge$date)))
      
      # min(discharge$date);max(discharge$date)
      
      ### discharge for turbidity model, 2021-11-24 to 2022-12-31
      
      # select data after 2021-11-24 and before 2022-12-31
      discharge_turb <- discharge[(discharge$date >= "2021-11-24") & (discharge$date <= "2022-12-31"), ]
      
      # change to numeric
      discharge_turb$discharge <- as.numeric(discharge_turb$discharge)
      
      ## interpolate the missing data using moving average
      library(imputeTS)
      discharge_turb$discharge <- na_ma(discharge_turb$discharge, k = 4, weighting = "exponential")
      
      ## sanity check
      # str(discharge_turb)
      # sum(is.na(discharge_turb))
      
      discharge_turb_df <- data.frame(discharge_turb$date,discharge_turb$discharge)
      
      # convert to time series
      discharge_turb_ts <- read.zoo(discharge_turb_df)
      head(discharge_turb_ts)
      2021-11-24 2021-11-25 2021-11-26 2021-11-27 2021-11-28 2021-11-29 
            2470       2480       2280       2260       2510       2250 
    • For sediment model, from 1983-9-29 to 2003-09-29 , the first 6 elements are shown below.

      Code
      ### discharge for sediment model, 1983-9-29 to 2003-09-29
      
      # select data after 1983-9-29  and before 2003-09-29
      discharge_sed <- discharge[(discharge$date >= "1983-9-29") & (discharge$date <= "2003-09-29"), ]
      
      # change to numeric
      discharge_sed$discharge <- as.numeric(discharge_sed$discharge)
      
      ## interpolate the missing data using moving average
      library(imputeTS)
      discharge_sed$discharge <- na_ma(discharge_sed$discharge, k = 4, weighting = "exponential")
      
      ## sanity check
      # str(discharge_turb)
      # sum(is.na(discharge_turb))
      
      discharge_sed_df <- data.frame(discharge_sed$date,discharge_sed$discharge)
      
      # convert to time series
      discharge_sed_ts <- read.zoo(discharge_sed_df)
      head(discharge_sed_ts)
      1983-09-29 1983-09-30 1983-10-01 1983-10-02 1983-10-03 1983-10-04 
             190        220        210        210        210        205 

    Precipitation and temperature (1940-01-01 - 2022-12-31)

    • For turbidity model, 2021-11-24 to 2022-12-31, the first 6 elements are shown below.

    • Precipitation

      Code
      ## read data and aggregate
      df <- data.frame(matrix(ncol = 6, nrow = 0))
      colnames(df) <- c("station","name","date","prcp","tmax","tmin")
      for (i in 1:3){
        file_name <- paste0("climate",i,".csv")
        df_curr <- read.csv(paste0("./data/",file_name), header = T)
        colnames(df_curr) <- c("station","name","date","prcp","tmax","tmin")
        df <- rbind(df_curr,df)
      }
      
      ## group by date: each day, we have one average prcp, tmax, and tmin based on all stations in Toledo
      
      climate <- df %>%
        group_by(date) %>%
        summarise_at(vars(prcp,tmax,tmin),  mean, na.rm = TRUE)
      
      
      climate$date<-as.Date(climate$date,"%Y-%m-%d")
      ## sanity check
      # any(duplicated(climate$date))
      # min(climate$date);max(climate$date)
      
      ### climate for turbidity model, 2021-11-24 to 2022-12-31
      
      # select data after 2021-11-24 and before 2022-12-31
      climate_trub <- climate[(climate$date >= "2021-11-24") & (climate$date <= "2022-12-31"), ]
      
      ## interpolate the missing data using moving average
      climate_trub$prcp <- na_ma(climate_trub$prcp, k = 4, weighting = "exponential")
      climate_trub$tmax <- na_ma(climate_trub$tmax, k = 4, weighting = "exponential")
      climate_trub$tmin <- na_ma(climate_trub$tmin, k = 4, weighting = "exponential")
      
      ## sanity check
      #str(climate_trub)
      #sum(is.na(climate_trub))
      #min(climate_trub$date);max(climate_trub$date)
      
      ## prcp 
      prcp_trub_df <- data.frame(climate_trub$date,climate_trub$prcp)
      
      # convert to time series
      prcp_trub_ts <- read.zoo(prcp_trub_df)
      head(prcp_trub_ts)
       2021-11-24  2021-11-25  2021-11-26  2021-11-27  2021-11-28  2021-11-29 
      0.002142857 0.133750000 0.068666667 0.009285714 0.078666667 0.001538462 
    • Maximum temperature

      Code
      ## tmax
      tmax_trub_df <- data.frame(climate_trub$date,climate_trub$tmax)
      
      # convert to time series
      tmax_trub_ts <- read.zoo(tmax_trub_df)
      head(tmax_trub_ts)
      2021-11-24 2021-11-25 2021-11-26 2021-11-27 2021-11-28 2021-11-29 
        45.33333   50.00000   43.66667   34.00000   39.33333   41.66667 
    • Minimum temperature

      Code
      ## tmin
      tmin_trub_df <- data.frame(climate_trub$date,climate_trub$tmin)
      
      # convert to time series
      tmin_trub_ts <- read.zoo(tmin_trub_df)
      head(tmin_trub_ts)
      2021-11-24 2021-11-25 2021-11-26 2021-11-27 2021-11-28 2021-11-29 
        23.33333   33.00000   27.66667   25.00000   29.66667   23.33333 
    • For sediment model, from 1983-9-29 to 2003-09-29 , the first 6 elements are shown below.

    • Precipitation

      Code
      ### climate for sediment model, 1983-9-29 to 2003-09-29
      
      # select data after 1983-9-29 and before 2003-09-29 
      climate_sed <- climate[(climate$date >= "1983-9-29") & (climate$date <= "2003-09-29"), ]
      
      ## interpolate the missing data using moving average
      climate_sed$prcp <- na_ma(climate_sed$prcp, k = 4, weighting = "exponential")
      climate_sed$tmax <- na_ma(climate_sed$tmax, k = 4, weighting = "exponential")
      climate_sed$tmin <- na_ma(climate_sed$tmin, k = 4, weighting = "exponential")
      
      ## sanity check
      #str(climate_sed)
      #sum(is.na(climate_sed))
      #min(climate_sed$date);max(climate_sed$date)
      
      ## prcp 
      prcp_sed_df <- data.frame(climate_sed$date,climate_sed$prcp)
      
      # convert to time series
      prcp_sed_ts <- read.zoo(prcp_sed_df)
      head(prcp_sed_ts)
      1983-09-29 1983-09-30 1983-10-01 1983-10-02 1983-10-03 1983-10-04 
            0.00       0.00       0.00       0.00       0.00       0.01 
    • Maximum temperature

      Code
      ## tmin
      tmax_sed_df <- data.frame(climate_sed$date,climate_sed$tmax)
      
      # convert to time series
      tmax_sed_ts <- read.zoo(tmax_sed_df)
      head(tmax_sed_ts)
      1983-09-29 1983-09-30 1983-10-01 1983-10-02 1983-10-03 1983-10-04 
            77.0       75.5       71.0       81.5       84.0       77.0 
    • Minimum temperature

      Code
      ## tmin
      tmin_sed_df <- data.frame(climate_sed$date,climate_sed$tmin)
      
      # convert to time series
      tmin_sed_ts <- read.zoo(tmin_sed_df)
      head(tmin_sed_ts)
      1983-09-29 1983-09-30 1983-10-01 1983-10-02 1983-10-03 1983-10-04 
            50.5       51.5       50.0       48.0       58.5       63.5 

    SARIMAX model of turbidity

    Plotting the Data

    (tab-df?) shows the data sample for turbidity model. Figure 1 shows the time series plots of the variables.

    Code
    turbidity_model_df <- data.frame(turb_df,discharge_turb_df$discharge_turb.discharge,prcp_trub_df$climate_trub.prcp,tmax_trub_df$climate_trub.tmax,tmin_trub_df$climate_trub.tmin)
    
    colnames(turbidity_model_df)<-c("date","turbidity","discharge","precipitation","tmax","tmin")
    
    knitr::kable(head(turbidity_model_df))
    date turbidity discharge precipitation tmax tmin
    2021-11-24 6.4 2470 0.0021429 45.33333 23.33333
    2021-11-25 6.3 2480 0.1337500 50.00000 33.00000
    2021-11-26 5.9 2280 0.0686667 43.66667 27.66667
    2021-11-27 8.8 2260 0.0092857 34.00000 25.00000
    2021-11-28 5.6 2510 0.0786667 39.33333 29.66667
    2021-11-29 6.0 2250 0.0015385 41.66667 23.33333
    Code
    turbidity_model.ts<-ts(turbidity_model_df,star=decimal_date(as.Date("2021-11-24",format = "%Y-%m-%d")),frequency = 365.25)
    
    autoplot(turbidity_model.ts[,c(2:6)], facets=TRUE) +
      xlab("Date") + ylab("") +
      ggtitle("Variables influencing turbidity")

    Figure 1: Time series of variables influencing sediment

    From SARIMA model session for stream discharge, all stream discharge need log transformation. Log transfor for precipitation would introduce NA, thus, the precipitation would stay the original form.

    Code
    lg.turbidity_model <- turbidity_model.ts[,2:6] #making a copy
    lg.turbidity_model[,1]<-log(turbidity_model_df$turbidity)
    lg.turbidity_model[,2]<-log(turbidity_model_df$discharge)

    After log transformation, the time series of the variables are shown in Figure 2.

    Code
    lg.turbidity_model.ts<-ts(lg.turbidity_model,star=decimal_date(as.Date("2021-11-24",format = "%Y-%m-%d")),frequency = 365.25)
    
    autoplot(lg.turbidity_model.ts, facets=TRUE) +
      xlab("Date") + ylab("") +
      ggtitle("Variables influencing turbidity")

    Figure 2: Time series of variables influencing sediment after log transformation

    Fitting the model using auto.arima()

    The model using auto.arima() is shown below.

    Code
    xreg <- cbind(discharge = lg.turbidity_model.ts[, "discharge"],
                  precipitation = lg.turbidity_model.ts[, "precipitation"],
                  tmax = lg.turbidity_model.ts[, "tmax"],
                  tmin = lg.turbidity_model.ts[, "tmin"])
    
    fit <- auto.arima(lg.turbidity_model.ts[, "turbidity"], xreg = xreg)
    summary(fit)
    Series: lg.turbidity_model.ts[, "turbidity"] 
    Regression with ARIMA(1,0,1) errors 
    
    Coefficients:
             ar1      ma1  discharge  precipitation    tmax    tmin
          0.8948  -0.2850     0.4246         0.0878  0.0006  0.0022
    s.e.  0.0282   0.0587     0.0233         0.0956  0.0037  0.0046
    
    sigma^2 = 0.1824:  log likelihood = -226.54
    AIC=467.08   AICc=467.36   BIC=495.07
    
    Training set error measures:
                         ME      RMSE      MAE       MPE     MAPE      MASE
    Training set 0.01048555 0.4239335 0.269546 -1.380152 8.165318 0.1412965
                        ACF1
    Training set -0.01060533
    Code
    checkresiduals(fit)
    
        Ljung-Box test
    
    data:  Residuals from Regression with ARIMA(1,0,1) errors
    Q* = 59.777, df = 79, p-value = 0.9474
    
    Model df: 2.   Total lags used: 81

    Figure 3: Residual diagonosis for auto.arima() model

    Figure 3 shows that the fitting model is an ARIMA(1,0,1) model. The ACF plot looks alright, all the errors are less than the threshold.

    Fitting the model manually

    1. Linear regression model
    Code
    fit.reg <- lm(turbidity ~ precipitation+tmax+tmin+discharge, data=lg.turbidity_model.ts)
    summary(fit.reg)
    
    Call:
    lm(formula = turbidity ~ precipitation + tmax + tmin + discharge, 
        data = lg.turbidity_model.ts)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -1.66757 -0.46131  0.02679  0.43976  2.11160 
    
    Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
    (Intercept)   -1.2352589  0.2432210  -5.079 5.85e-07 ***
    precipitation  0.1325368  0.1759714   0.753    0.452    
    tmax           0.0054724  0.0051649   1.060    0.290    
    tmin           0.0005487  0.0059018   0.093    0.926    
    discharge      0.5599929  0.0252455  22.182  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.6838 on 398 degrees of freedom
    Multiple R-squared:  0.5603,    Adjusted R-squared:  0.5559 
    F-statistic: 126.8 on 4 and 398 DF,  p-value: < 2.2e-16
    1. Residual fit

    Figure 4 shows the ACF and PACF plots for the residuals. The ACF has a fading pattern, which indicates that differencing may be needed.

    Code
    res.fit<-ts(residuals(fit.reg),star=decimal_date(as.Date("2021-11-24",format = "%Y-%m-%d")),frequency = 365.25)
    
    ############## Then look at the residuals ############
    acf(res.fit)
    Pacf(res.fit)

    (a) ACF plot for residuals of linear regression

    (b) PACF plot for residuals of linear regression

    Figure 4: ACF and PACF plots for SARIMAX model

    • First differencing

      Code
      res.fit %>% diff() %>% ggtsdisplay()

      Figure 5: Residual diagonosis after linear regression model and differencing

      Figure 5 shows that p=1,2; q=1.

    • Second differencing with seasonality (annually, Figure 6).

      Code
      res.fit %>% diff() %>% diff(365) %>% ggtsdisplay()

      Figure 6: Residual diagonosis after linear regression model and second differencing with annually seasonality

      1. Finding the model parameters
      Code
      ######################## Check for different combinations ########
      
      #write a funtion
      SARIMA.c=function(p1,p2,q1,q2,d1,d2,P1,P2,Q1,Q2,data){
        
        #K=(p2+1)*(q2+1)*(d2)*(P2+1)*(Q2+1)
        
        temp=c()
        D=1
        s=12
        
        i=1
        temp= data.frame()
        ls=matrix(rep(NA,9*48),nrow=48)
        
        
        for (p in p1:p2)
        {
          for(q in q1:q2)
          {
            for(d in d1:d2){
              
              for(P in P1:P2)
              {
                for(Q in Q1:Q2)
                {
                  if(p+d+q+P+Q+D<=12)
                  {
                    skip_to_next <- FALSE
                    model<- tryCatch({
            Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1))
          }, error = function(e) { skip_to_next <<- TRUE})
        if(skip_to_next) { next }
                    ls[i,]= c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc)
                    i=i+1
                    #print(i)
                    }
                  
                }
                
              }
            }
          }
          
        }
        
        
        temp= as.data.frame(ls)
        names(temp)= c("p","d","q","P","D","Q","AIC","BIC","AICc")
        
        temp
        
      }

      All AIC, BIC, and AICc give the best model as ARIMA(1,2,1,0,1,0)[365] for residual.

      # q=0,1,; Q=1,2 and PACF plot: p=0,1,2; P=1,2, D=1 and d=0,1,2
      output=SARIMA.c(p1=1,p2=3,q1=1,q2=2,d1=1,d2=3,P1=1,P2=3,Q1=1,Q2=3,data=res.fit)
      #output
      
      knitr::kable(output)
      p d q P D Q AIC BIC AICc
      0 1 0 0 1 0 97.25568 98.86660 97.36997
      0 2 0 0 1 0 128.65527 130.23879 128.77292
      0 3 0 0 1 0 173.77761 175.33296 173.89883
      0 1 1 0 1 0 92.99954 96.22138 93.35248
      0 2 1 0 1 0 91.22853 94.39557 91.59217
      0 3 1 0 1 0 127.30644 130.41713 127.68144
      1 1 0 0 1 0 92.06895 95.29078 92.42189
      1 2 0 0 1 0 102.87068 106.03772 103.23432
      1 3 0 0 1 0 132.89719 136.00788 133.27219
      1 1 1 0 1 0 93.60573 98.43849 94.33301
      1 2 1 0 1 0 86.63715 91.38771 87.38715
      1 3 1 0 1 0 102.32171 106.98775 103.09590
      2 1 0 0 1 0 93.87314 98.70589 94.60041
      2 2 0 0 1 0 102.36060 107.11116 103.11060
      2 3 0 0 1 0 126.33729 131.00334 127.11149
      2 1 1 0 1 0 95.58437 102.02804 96.83437
      2 2 1 0 1 0 88.34723 94.68131 89.63755
      2 3 1 0 1 0 101.89712 108.11852 103.23046
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      NA NA NA NA NA NA NA NA NA
      output[which.min(output$AIC),] 
         p d q P D Q      AIC      BIC     AICc
      11 1 2 1 0 1 0 86.63715 91.38771 87.38715
      output[which.min(output$BIC),]
         p d q P D Q      AIC      BIC     AICc
      11 1 2 1 0 1 0 86.63715 91.38771 87.38715
      output[which.min(output$AICc),]
         p d q P D Q      AIC      BIC     AICc
      11 1 2 1 0 1 0 86.63715 91.38771 87.38715

      However, the turbidity data has only over 400 observations. Here. I show best model is shown below with ARIMA(1,2,1,0,1,0) .

      Code
      set.seed(1234)
      
      model_output12 <- capture.output(sarima(res.fit, 1,2,1, 0,1,0,0)) 

      The Ljun-Box statistic and Q-Q plot look ok, but ACF plot could be better. This indicate the analysis need more data to improve the model.

    Cross validation

    n=length(res.fit)
    k=200
     
     #n-k=203
    
    
    err1 = c()
    err2 = c()
    
    for(i in 1:(n-k))
    {
      xtrain <- res.fit[1:(k-1)+i] #observations from 1 to 350
      xtest <- res.fit[k+i] #351th observation as the test set
      
      fit <- Arima(xtrain, order=c(1,0,1),
                    include.drift=FALSE, method="ML")
      fcast1 <- forecast(fit, h=4)
      
      fit2 <- Arima(xtrain, order=c(1,2,1), seasonal=c(0,1,0),
                    include.drift=FALSE, method="ML")
      fcast2 <- forecast(fit2, h=4)
      
      #capture error for each iteration
      # This is mean absolute error
      err1 = c(err1, abs(fcast1$mean-xtest)) 
      err2 = c(err2, abs(fcast2$mean-xtest))
      
      # This is mean squared error
      err3 = c(err1, (fcast1$mean-xtest)^2)
      err4 = c(err2, (fcast2$mean-xtest)^2)
      
    }
    
    
    (MAE1=mean(err1)) # This is mean absolute error
    [1] 0.2824119
    (MAE2=mean(err2)) 
    [1] 0.304012
    RMSE1=sqrt(mean(err1^2)) #fit1,0,1,
    RMSE2=sqrt(mean(err2^2))#fit1,2,1,0,1,0
    
    RMSE1
    [1] 0.4392394
    RMSE2
    [1] 0.4838416

    Both MAE and RMSE suggest the manually fitted model, ARIMA(1,2,1,0,1,0) is better than the auto.arima() model, ARIMA(1,0,1).

    Best model

    fit <- Arima(lg.turbidity_model.ts[, "turbidity"], order=c(1,2,1),seasonal = c(0,1,0), xreg = xreg)
    summary(fit)
    Series: lg.turbidity_model.ts[, "turbidity"] 
    Regression with ARIMA(1,2,1)(0,1,0)[365] errors 
    
    Coefficients:
              ar1      ma1  discharge  precipitation     tmax     tmin
          -0.4437  -0.9995     0.5288         0.4683  -0.0003  -0.0088
    s.e.   0.1576   0.0387     0.2065         0.3360   0.0124   0.0160
    
    sigma^2 = 0.6045:  log likelihood = -39.19
    AIC=92.38   AICc=96.38   BIC=103.46
    
    Training set error measures:
                           ME      RMSE        MAE        MPE     MAPE       MASE
    Training set -0.001783241 0.2121265 0.04570337 -0.3000952 1.606162 0.02395778
                        ACF1
    Training set 0.008323962

    \[ (1 + 0.4437B^1 + B^2) * (1 - B^{365}) * discharge(t) = 0.5288 * discharge(t) + 0.4683 * precipitation(t) - 0.0003 * tmax(t) - 0.0088 * tmin(t) - 0.9995\omega(t-1) * \omega(t) \]

    Forcasting

    1. Turbidity
    Code
    discharge_fit<-auto.arima(lg.turbidity_model[,"discharge"]) #fiting an ARIMA model to the Export variable
    summary(discharge_fit) 
    Series: lg.turbidity_model[, "discharge"] 
    ARIMA(1,1,3) 
    
    Coefficients:
             ar1      ma1      ma2      ma3
          0.7870  -0.4365  -0.3785  -0.1139
    s.e.  0.0576   0.0730   0.0527   0.0600
    
    sigma^2 = 0.115:  log likelihood = -134.08
    AIC=278.16   AICc=278.31   BIC=298.15
    
    Training set error measures:
                          ME      RMSE      MAE        MPE     MAPE      MASE
    Training set -0.01051251 0.3370327 0.237908 -0.3106121 3.198979 0.1109779
                         ACF1
    Training set 3.796219e-05
    Code
    fdisc<-forecast(discharge_fit,20)
    1. Precipitation
    Code
    precipitation_fit<-auto.arima(lg.turbidity_model[,"precipitation"]) #fiting an ARIMA model to the Export variable
    summary(precipitation_fit) 
    Series: lg.turbidity_model[, "precipitation"] 
    ARIMA(0,0,1) with non-zero mean 
    
    Coefficients:
             ma1    mean
          0.1524  0.0924
    s.e.  0.0510  0.0115
    
    sigma^2 = 0.04005:  log likelihood = 77.5
    AIC=-149   AICc=-148.94   BIC=-137.01
    
    Training set error measures:
                           ME      RMSE       MAE  MPE MAPE      MASE        ACF1
    Training set 4.109928e-05 0.1996324 0.1246119 -Inf  Inf 0.7434769 -0.00235585
    Code
    fprcp<-forecast(precipitation_fit,20)
    1. Tmax
    Code
    tmax_fit<-auto.arima(lg.turbidity_model[,"tmax"]) #fiting an ARIMA model to the Export variable
    summary(tmax_fit) 
    Series: lg.turbidity_model[, "tmax"] 
    ARIMA(2,1,2) 
    
    Coefficients:
             ar1      ar2      ma1      ma2
          0.6389  -0.2293  -0.5212  -0.2855
    s.e.  0.1075   0.0963   0.1069   0.1019
    
    sigma^2 = 37.76:  log likelihood = -1298.71
    AIC=2607.42   AICc=2607.57   BIC=2627.4
    
    Training set error measures:
                         ME     RMSE     MAE       MPE     MAPE      MASE
    Training set -0.0507898 6.106834 4.42749 -3.117993 11.34176 0.3541992
                         ACF1
    Training set 0.0003491482
    Code
    ftmax<-forecast(tmax_fit,20)
    1. Tmin
    Code
    tmin_fit<-auto.arima(lg.turbidity_model[,"tmin"]) #fiting an ARIMA model to the Export variable
    summary(tmin_fit) 
    Series: lg.turbidity_model[, "tmin"] 
    ARIMA(2,1,2) 
    
    Coefficients:
             ar1      ar2      ma1      ma2
          0.5618  -0.1251  -0.5784  -0.2393
    s.e.  0.1661   0.1353   0.1648   0.1541
    
    sigma^2 = 32.33:  log likelihood = -1267.38
    AIC=2544.76   AICc=2544.91   BIC=2564.74
    
    Training set error measures:
                          ME     RMSE      MAE      MPE     MAPE      MASE
    Training set 0.005936972 5.650462 4.257408 5.316034 27.88954 0.4737379
                         ACF1
    Training set -0.001209705
    Code
    ftmin<-forecast(tmin_fit,20)

    4 Forecast

    Code
    fxreg <- cbind(discharge = fdisc$mean,
                  precipitation = fprcp$mean,
                  tmax = ftmax$mean,
                  tmin = ftmin$mean)
    
    fturb <- forecast(fit, xreg=fxreg) 
    autoplot(fturb) + xlab("Year") +
      ylab("Turbidity")

    Conclusion

    The manually fitted model shows that the turbidity will be decrease in the near future based on stream discharge, precipitation, tmax, tmin. This agrees with the stream discharge annual pattern: lower in fall and winter, higher in spring due to snow melting and higher precipitation. However, the annual pattern may not be captured well, because that the sample size is small. Thus, this model need to be verified with larger dataset in the future.

    VAR model

    The variables include precipitation, stream discharge, precipitation. Because I want to explore the longer lag relationship among them, I will use monthly data here.

    Code
    sed_model_df <- data.frame(sed_df,discharge_sed$discharge, prcp_sed_ts)
    
    colnames(sed_model_df)<-c("date","sediment","discharge","precipitation")
    
    
    ### can get monthly data
    # Get mean value for each month
    
    # take log of the discharge
    sed_model_df["discharge"] <- log(sed_model_df["discharge"])
    
    mean_data <- sed_model_df %>% 
      mutate(month = month(sed_model_df$date), year = year(sed_model_df$date)) %>% 
      group_by(year, month) %>% 
      summarise_at(vars(c("sediment","discharge","precipitation")), list(mean = mean))
    
    
    var_month<-ts(mean_data[3:5],star=decimal_date(as.Date("1983-09-29",format = "%Y-%m-%d")),frequency = 12)
    
    
    
    
    
    autoplot(var_month[,c(1:3)], facets=TRUE) +
      xlab("Time") + ylab("") +
      ggtitle("Monthly variables")

    Code
    pairs(var_month)

    Code
    ###### select p #########        
    VARselect(var_month, lag.max=12, type="both") ## largest annualy
    $selection
    AIC(n)  HQ(n)  SC(n) FPE(n) 
         8      3      1      8 
    
    $criteria
                   1         2         3         4         5         6         7
    AIC(n) 0.7481091 0.7152077 0.5944947 0.5450676 0.5391289 0.4443612 0.4271302
    HQ(n)  0.8388459 0.8603866 0.7941157 0.7991307 0.8476341 0.8073084 0.8445196
    SC(n)  0.9730254 1.0750738 1.0893105 1.1748332 1.3038443 1.3440263 1.4617452
    FPE(n) 2.1130447 2.0447859 1.8125173 1.7255152 1.7159209 1.5615824 1.5359804
                   8         9       10        11        12
    AIC(n) 0.4008526 0.4433351 0.472206 0.4535587 0.4633184
    HQ(n)  0.8726840 0.9696087 1.052922 1.0887164 1.1529182
    SC(n)  1.5704173 1.7478496 1.911670 2.0279727 2.1726822
    FPE(n) 1.4975083 1.5642969 1.612407 1.5853433 1.6041771
    Code
    #summary(VAR(var_month, p=1, type='both')) # 'both' fits constant + trend

    The results show p=1, 3 and 8 are with good SC, HQ, or AIC/FPE.

    VAR(1)

    Code
    summary(vars::VAR(var_month, p=1, type='both')) # 'both' fits constant + trend
    
    VAR Estimation Results:
    ========================= 
    Endogenous variables: sediment_mean, discharge_mean, precipitation_mean 
    Deterministic variables: both 
    Sample size: 240 
    Log Likelihood: -1105.261 
    Roots of the characteristic polynomial:
    0.642 0.2904 0.1438
    Call:
    vars::VAR(y = var_month, p = 1, type = "both")
    
    
    Estimation results for equation sediment_mean: 
    ============================================== 
    sediment_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)   
    sediment_mean.l1         0.21918    0.08914   2.459   0.0147 * 
    discharge_mean.l1       15.81294    4.85883   3.254   0.0013 **
    precipitation_mean.l1 -184.89650   85.00729  -2.175   0.0306 * 
    const                  -61.10324   35.63190  -1.715   0.0877 . 
    trend                    0.10024    0.05492   1.825   0.0692 . 
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    
    Residual standard error: 57.1 on 235 degrees of freedom
    Multiple R-Squared: 0.1888, Adjusted R-squared: 0.175 
    F-statistic: 13.68 on 4 and 235 DF,  p-value: 4.852e-10 
    
    
    Estimation results for equation discharge_mean: 
    =============================================== 
    discharge_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)    
    sediment_mean.l1      -0.0034489  0.0013087  -2.635  0.00896 ** 
    discharge_mean.l1      0.7729023  0.0713339  10.835  < 2e-16 ***
    precipitation_mean.l1 -2.0939191  1.2480161  -1.678  0.09472 .  
    const                  2.1936310  0.5231221   4.193  3.9e-05 ***
    trend                  0.0001387  0.0008063   0.172  0.86354    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    
    Residual standard error: 0.8383 on 235 degrees of freedom
    Multiple R-Squared: 0.4004, Adjusted R-squared: 0.3902 
    F-statistic: 39.24 on 4 and 235 DF,  p-value: < 2.2e-16 
    
    
    Estimation results for equation precipitation_mean: 
    =================================================== 
    precipitation_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)   
    sediment_mean.l1       8.486e-05  7.686e-05   1.104   0.2707   
    discharge_mean.l1     -1.225e-03  4.190e-03  -0.292   0.7703   
    precipitation_mean.l1  8.411e-02  7.330e-02   1.148   0.2523   
    const                  9.280e-02  3.072e-02   3.021   0.0028 **
    trend                 -3.091e-05  4.736e-05  -0.653   0.5145   
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    
    Residual standard error: 0.04923 on 235 degrees of freedom
    Multiple R-Squared: 0.02277,    Adjusted R-squared: 0.006138 
    F-statistic: 1.369 on 4 and 235 DF,  p-value: 0.2454 
    
    
    
    Covariance matrix of residuals:
                       sediment_mean discharge_mean precipitation_mean
    sediment_mean           3260.098       33.46433           1.344236
    discharge_mean            33.464        0.70268           0.017749
    precipitation_mean         1.344        0.01775           0.002424
    
    Correlation matrix of residuals:
                       sediment_mean discharge_mean precipitation_mean
    sediment_mean             1.0000         0.6992             0.4782
    discharge_mean            0.6992         1.0000             0.4301
    precipitation_mean        0.4782         0.4301             1.0000

    VAR(3)

    Code
    summary(vars::VAR(var_month, p=3, type='both')) 
    
    VAR Estimation Results:
    ========================= 
    Endogenous variables: sediment_mean, discharge_mean, precipitation_mean 
    Deterministic variables: both 
    Sample size: 238 
    Log Likelihood: -1058.944 
    Roots of the characteristic polynomial:
    0.7776 0.7776 0.725 0.591 0.4909 0.4909 0.4119 0.4119 0.272
    Call:
    vars::VAR(y = var_month, p = 3, type = "both")
    
    
    Estimation results for equation sediment_mean: 
    ============================================== 
    sediment_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + sediment_mean.l2 + discharge_mean.l2 + precipitation_mean.l2 + sediment_mean.l3 + discharge_mean.l3 + precipitation_mean.l3 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)   
    sediment_mean.l1         0.09740    0.09174   1.062  0.28952   
    discharge_mean.l1       15.23421    5.99443   2.541  0.01171 * 
    precipitation_mean.l1 -124.09096   84.52566  -1.468  0.14347   
    sediment_mean.l2         0.16936    0.09667   1.752  0.08112 . 
    discharge_mean.l2        8.59342    7.28602   1.179  0.23946   
    precipitation_mean.l2 -243.88524   85.65124  -2.847  0.00481 **
    sediment_mean.l3         0.21419    0.09470   2.262  0.02466 * 
    discharge_mean.l3      -12.73707    6.13465  -2.076  0.03900 * 
    precipitation_mean.l3 -246.08273   84.46078  -2.914  0.00393 **
    const                   -0.11040   42.65437  -0.003  0.99794   
    trend                    0.06540    0.05453   1.199  0.23160   
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    
    Residual standard error: 54.01 on 227 degrees of freedom
    Multiple R-Squared: 0.2885, Adjusted R-squared: 0.2572 
    F-statistic: 9.206 on 10 and 227 DF,  p-value: 9.451e-13 
    
    
    Estimation results for equation discharge_mean: 
    =============================================== 
    discharge_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + sediment_mean.l2 + discharge_mean.l2 + precipitation_mean.l2 + sediment_mean.l3 + discharge_mean.l3 + precipitation_mean.l3 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)    
    sediment_mean.l1      -4.097e-03  1.360e-03  -3.012 0.002893 ** 
    discharge_mean.l1      7.648e-01  8.889e-02   8.604 1.28e-15 ***
    precipitation_mean.l1 -1.069e+00  1.253e+00  -0.853 0.394548    
    sediment_mean.l2       1.430e-03  1.433e-03   0.997 0.319616    
    discharge_mean.l2      2.069e-01  1.080e-01   1.915 0.056755 .  
    precipitation_mean.l2 -4.415e+00  1.270e+00  -3.476 0.000609 ***
    sediment_mean.l3       7.368e-04  1.404e-03   0.525 0.600335    
    discharge_mean.l3     -2.912e-01  9.097e-02  -3.201 0.001565 ** 
    precipitation_mean.l3 -2.050e-01  1.252e+00  -0.164 0.870161    
    const                  3.141e+00  6.325e-01   4.965 1.35e-06 ***
    trend                 -1.043e-05  8.086e-04  -0.013 0.989718    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    
    Residual standard error: 0.8009 on 227 degrees of freedom
    Multiple R-Squared: 0.4676, Adjusted R-squared: 0.4441 
    F-statistic: 19.94 on 10 and 227 DF,  p-value: < 2.2e-16 
    
    
    Estimation results for equation precipitation_mean: 
    =================================================== 
    precipitation_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + sediment_mean.l2 + discharge_mean.l2 + precipitation_mean.l2 + sediment_mean.l3 + discharge_mean.l3 + precipitation_mean.l3 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)
    sediment_mean.l1       8.239e-05  8.281e-05   0.995    0.321
    discharge_mean.l1     -7.274e-03  5.411e-03  -1.344    0.180
    precipitation_mean.l1  1.145e-01  7.630e-02   1.501    0.135
    sediment_mean.l2      -5.021e-05  8.726e-05  -0.575    0.566
    discharge_mean.l2      8.018e-03  6.577e-03   1.219    0.224
    precipitation_mean.l2 -2.788e-02  7.731e-02  -0.361    0.719
    sediment_mean.l3       1.935e-05  8.548e-05   0.226    0.821
    discharge_mean.l3      5.376e-03  5.537e-03   0.971    0.333
    precipitation_mean.l3 -9.352e-02  7.624e-02  -1.227    0.221
    const                  4.405e-02  3.850e-02   1.144    0.254
    trend                 -1.272e-05  4.922e-05  -0.259    0.796
    
    
    Residual standard error: 0.04875 on 227 degrees of freedom
    Multiple R-Squared: 0.05668,    Adjusted R-squared: 0.01513 
    F-statistic: 1.364 on 10 and 227 DF,  p-value: 0.1982 
    
    
    
    Covariance matrix of residuals:
                       sediment_mean discharge_mean precipitation_mean
    sediment_mean           2916.663       30.02468           1.271526
    discharge_mean            30.025        0.64136           0.018059
    precipitation_mean         1.272        0.01806           0.002376
    
    Correlation matrix of residuals:
                       sediment_mean discharge_mean precipitation_mean
    sediment_mean             1.0000         0.6942             0.4830
    discharge_mean            0.6942         1.0000             0.4626
    precipitation_mean        0.4830         0.4626             1.0000

    VAR(8)

    Code
    summary(vars::VAR(var_month, p=8, type='both')) 
    
    VAR Estimation Results:
    ========================= 
    Endogenous variables: sediment_mean, discharge_mean, precipitation_mean 
    Deterministic variables: both 
    Sample size: 233 
    Log Likelihood: -956.712 
    Roots of the characteristic polynomial:
    0.979 0.979 0.8923 0.8331 0.8331 0.7842 0.7541 0.7541 0.7476 0.7476 0.7376 0.7376 0.7238 0.7109 0.7109 0.6854 0.6854 0.6632 0.6632 0.5744 0.5744 0.3774 0.3774 0.2044
    Call:
    vars::VAR(y = var_month, p = 8, type = "both")
    
    
    Estimation results for equation sediment_mean: 
    ============================================== 
    sediment_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + sediment_mean.l2 + discharge_mean.l2 + precipitation_mean.l2 + sediment_mean.l3 + discharge_mean.l3 + precipitation_mean.l3 + sediment_mean.l4 + discharge_mean.l4 + precipitation_mean.l4 + sediment_mean.l5 + discharge_mean.l5 + precipitation_mean.l5 + sediment_mean.l6 + discharge_mean.l6 + precipitation_mean.l6 + sediment_mean.l7 + discharge_mean.l7 + precipitation_mean.l7 + sediment_mean.l8 + discharge_mean.l8 + precipitation_mean.l8 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)   
    sediment_mean.l1       4.184e-02  9.738e-02   0.430  0.66792   
    discharge_mean.l1      1.009e+01  7.470e+00   1.350  0.17837   
    precipitation_mean.l1 -2.040e+01  9.835e+01  -0.207  0.83592   
    sediment_mean.l2       1.477e-01  1.012e-01   1.460  0.14582   
    discharge_mean.l2      6.393e+00  7.970e+00   0.802  0.42342   
    precipitation_mean.l2 -1.228e+02  9.645e+01  -1.273  0.20445   
    sediment_mean.l3       1.086e-01  1.020e-01   1.065  0.28826   
    discharge_mean.l3     -3.570e-02  7.898e+00  -0.005  0.99640   
    precipitation_mean.l3 -2.420e+02  9.181e+01  -2.636  0.00902 **
    sediment_mean.l4       1.238e-01  1.012e-01   1.223  0.22276   
    discharge_mean.l4     -4.606e+00  7.621e+00  -0.604  0.54621   
    precipitation_mean.l4 -9.518e+01  9.199e+01  -1.035  0.30206   
    sediment_mean.l5       2.048e-01  1.006e-01   2.035  0.04308 * 
    discharge_mean.l5     -9.153e+00  7.511e+00  -1.219  0.22439   
    precipitation_mean.l5 -5.019e+01  9.248e+01  -0.543  0.58792   
    sediment_mean.l6       6.143e-03  1.005e-01   0.061  0.95130   
    discharge_mean.l6     -4.396e+00  7.712e+00  -0.570  0.56933   
    precipitation_mean.l6  1.003e+02  9.256e+01   1.083  0.28002   
    sediment_mean.l7      -7.048e-03  9.992e-02  -0.071  0.94383   
    discharge_mean.l7     -2.602e+00  7.752e+00  -0.336  0.73752   
    precipitation_mean.l7  7.137e+01  1.007e+02   0.709  0.47937   
    sediment_mean.l8       2.015e-02  9.765e-02   0.206  0.83675   
    discharge_mean.l8      2.057e+00  6.766e+00   0.304  0.76140   
    precipitation_mean.l8  1.041e+02  9.944e+01   1.047  0.29615   
    const                  5.646e+01  7.650e+01   0.738  0.46134   
    trend                  7.107e-02  5.775e-02   1.231  0.21984   
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    
    Residual standard error: 52.91 on 207 degrees of freedom
    Multiple R-Squared: 0.3283, Adjusted R-squared: 0.2471 
    F-statistic: 4.046 on 25 and 207 DF,  p-value: 9.672e-09 
    
    
    Estimation results for equation discharge_mean: 
    =============================================== 
    discharge_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + sediment_mean.l2 + discharge_mean.l2 + precipitation_mean.l2 + sediment_mean.l3 + discharge_mean.l3 + precipitation_mean.l3 + sediment_mean.l4 + discharge_mean.l4 + precipitation_mean.l4 + sediment_mean.l5 + discharge_mean.l5 + precipitation_mean.l5 + sediment_mean.l6 + discharge_mean.l6 + precipitation_mean.l6 + sediment_mean.l7 + discharge_mean.l7 + precipitation_mean.l7 + sediment_mean.l8 + discharge_mean.l8 + precipitation_mean.l8 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)    
    sediment_mean.l1      -2.995e-03  1.381e-03  -2.169 0.031243 *  
    discharge_mean.l1      4.720e-01  1.059e-01   4.456 1.37e-05 ***
    precipitation_mean.l1  2.394e+00  1.395e+00   1.717 0.087553 .  
    sediment_mean.l2       1.122e-03  1.435e-03   0.782 0.435001    
    discharge_mean.l2      2.122e-01  1.130e-01   1.877 0.061908 .  
    precipitation_mean.l2 -1.670e+00  1.368e+00  -1.221 0.223352    
    sediment_mean.l3      -1.096e-05  1.446e-03  -0.008 0.993960    
    discharge_mean.l3     -1.233e-02  1.120e-01  -0.110 0.912478    
    precipitation_mean.l3 -1.153e+00  1.302e+00  -0.886 0.376746    
    sediment_mean.l4       5.921e-04  1.435e-03   0.413 0.680377    
    discharge_mean.l4     -4.996e-02  1.081e-01  -0.462 0.644335    
    precipitation_mean.l4  4.998e-01  1.305e+00   0.383 0.702030    
    sediment_mean.l5       2.281e-03  1.427e-03   1.598 0.111497    
    discharge_mean.l5     -2.240e-01  1.065e-01  -2.103 0.036651 *  
    precipitation_mean.l5  4.568e-01  1.311e+00   0.348 0.727974    
    sediment_mean.l6      -3.587e-04  1.425e-03  -0.252 0.801439    
    discharge_mean.l6     -1.939e-01  1.094e-01  -1.773 0.077762 .  
    precipitation_mean.l6  4.613e+00  1.313e+00   3.514 0.000542 ***
    sediment_mean.l7      -1.337e-04  1.417e-03  -0.094 0.924922    
    discharge_mean.l7     -7.190e-02  1.099e-01  -0.654 0.513830    
    precipitation_mean.l7  3.049e+00  1.428e+00   2.134 0.033992 *  
    sediment_mean.l8      -1.271e-04  1.385e-03  -0.092 0.926959    
    discharge_mean.l8      1.260e-01  9.594e-02   1.313 0.190688    
    precipitation_mean.l8  2.004e+00  1.410e+00   1.421 0.156716    
    const                  4.751e+00  1.085e+00   4.380 1.89e-05 ***
    trend                  1.367e-04  8.189e-04   0.167 0.867569    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    
    Residual standard error: 0.7503 on 207 degrees of freedom
    Multiple R-Squared: 0.5602, Adjusted R-squared: 0.507 
    F-statistic: 10.55 on 25 and 207 DF,  p-value: < 2.2e-16 
    
    
    Estimation results for equation precipitation_mean: 
    =================================================== 
    precipitation_mean = sediment_mean.l1 + discharge_mean.l1 + precipitation_mean.l1 + sediment_mean.l2 + discharge_mean.l2 + precipitation_mean.l2 + sediment_mean.l3 + discharge_mean.l3 + precipitation_mean.l3 + sediment_mean.l4 + discharge_mean.l4 + precipitation_mean.l4 + sediment_mean.l5 + discharge_mean.l5 + precipitation_mean.l5 + sediment_mean.l6 + discharge_mean.l6 + precipitation_mean.l6 + sediment_mean.l7 + discharge_mean.l7 + precipitation_mean.l7 + sediment_mean.l8 + discharge_mean.l8 + precipitation_mean.l8 + const + trend 
    
                            Estimate Std. Error t value Pr(>|t|)  
    sediment_mean.l1       3.528e-05  8.940e-05   0.395    0.694  
    discharge_mean.l1      3.381e-03  6.858e-03   0.493    0.623  
    precipitation_mean.l1 -1.801e-02  9.029e-02  -0.199    0.842  
    sediment_mean.l2       4.444e-08  9.287e-05   0.000    1.000  
    discharge_mean.l2      4.697e-03  7.317e-03   0.642    0.522  
    precipitation_mean.l2 -9.101e-02  8.855e-02  -1.028    0.305  
    sediment_mean.l3       7.905e-06  9.363e-05   0.084    0.933  
    discharge_mean.l3     -1.600e-03  7.251e-03  -0.221    0.826  
    precipitation_mean.l3 -3.523e-02  8.428e-02  -0.418    0.676  
    sediment_mean.l4       2.425e-05  9.292e-05   0.261    0.794  
    discharge_mean.l4      4.559e-04  6.996e-03   0.065    0.948  
    precipitation_mean.l4 -9.918e-02  8.446e-02  -1.174    0.242  
    sediment_mean.l5      -7.392e-06  9.238e-05  -0.080    0.936  
    discharge_mean.l5      6.098e-03  6.896e-03   0.884    0.378  
    precipitation_mean.l5 -6.192e-02  8.490e-02  -0.729    0.467  
    sediment_mean.l6      -8.211e-05  9.222e-05  -0.890    0.374  
    discharge_mean.l6      9.482e-03  7.080e-03   1.339    0.182  
    precipitation_mean.l6 -1.814e-01  8.498e-02  -2.134    0.034 *
    sediment_mean.l7      -2.095e-06  9.174e-05  -0.023    0.982  
    discharge_mean.l7     -1.961e-03  7.117e-03  -0.276    0.783  
    precipitation_mean.l7 -1.092e-01  9.247e-02  -1.180    0.239  
    sediment_mean.l8      -3.165e-05  8.965e-05  -0.353    0.724  
    discharge_mean.l8      1.574e-03  6.211e-03   0.253    0.800  
    precipitation_mean.l8 -5.069e-02  9.129e-02  -0.555    0.579  
    const                 -1.248e-02  7.023e-02  -0.178    0.859  
    trend                 -1.588e-05  5.301e-05  -0.299    0.765  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    
    Residual standard error: 0.04858 on 207 degrees of freedom
    Multiple R-Squared: 0.1268, Adjusted R-squared: 0.02139 
    F-statistic: 1.203 on 25 and 207 DF,  p-value: 0.2395 
    
    
    
    Covariance matrix of residuals:
                       sediment_mean discharge_mean precipitation_mean
    sediment_mean           2799.546       28.28649            1.38484
    discharge_mean            28.286        0.56299            0.02334
    precipitation_mean         1.385        0.02334            0.00236
    
    Correlation matrix of residuals:
                       sediment_mean discharge_mean precipitation_mean
    sediment_mean             1.0000         0.7125             0.5388
    discharge_mean            0.7125         1.0000             0.6403
    precipitation_mean        0.5388         0.6403             1.0000

    For VAR(1), VAR(3) and VAR(8) models, when using sediment as dependent variable, both discharge and precipitation are significant. Discharge also related to sediment adn precipitation significantly. However, precipitation is not affected by either sediment or discharge. I have also tried using 10 years data and get VAR(7) model with good AIC. On the contrary, VAR(7) model shows precipitation are correlated with lag 7 of discharge and sediment. This may imply the periodicity of water cycle. And also the length of data various may give different model.

    Cross validation

    Code
    var_month1 <- var_month[2:241,]
    n=nrow(var_month)
    k=76 #19*4
    
    #n-k=164; 164/12=13.67;
    
    rmse1 <- matrix(NA, 12*13,3)
    rmse2 <- matrix(NA, 12*13,3)
    rmse3 <- matrix(NA, 12*13,3)
    
    year<-c()
    
    # Convert data frame to time series object
    ts_obj <- ts(var_month1[, c(1:3)], star=decimal_date(as.Date("1983-09-29",format = "%Y-%m-%d")),frequency = 12)
    
    st <- tsp(ts_obj )[1]+(k-1)/12
    
    
    for(i in 1:13) # i is a year
    {
      
      xtrain <- window(ts_obj, end=st + i-1)
      xtest <- window(ts_obj, start=st + (i-1) + 1/12, end=st + i)
      
      ## p=1
      fit <- vars::VAR(ts_obj, p=1, type='both')
      fcast <- predict(fit, n.ahead = 12)
      
      fsed<-fcast$fcst$sediment_mean
      fprcp<-fcast$fcst$precipitation_mean
      fdisc<-fcast$fcst$discharge_mean
      ff<-data.frame(fsed[,1],fprcp[,1],fdisc[,1])
      
      year<-st + (i-1) + 1/12
      
      ff<-ts(ff,start=c(year,1),frequency = 12)
      
      a = (i-1)*12+1
      b= a+11
      rmse1[c(a:b),] <- sqrt((ff-xtest)^2)
      
      
      ## p=3
      fit2 <- vars::VAR(ts_obj, p=3, type='both')
      fcast2 <- predict(fit2, n.ahead = 12)
      
      fsed<-fcast2$fcst$sediment_mean
      fprcp<-fcast2$fcst$precipitation_mean
      fdisc<-fcast2$fcst$discharge_mean
      ff2<-data.frame(fsed[,1],fprcp[,1],fdisc[,1])
      
      year<-st + (i-1) + 1/12
      
      ff2<-ts(ff2,start=c(year,1),frequency = 12)
      
      rmse2[c(a:b),] <-sqrt((ff2-xtest)^2)
      
      ## p=8
      fit3 <- vars::VAR(ts_obj, p=8, type='both')
      fcast3 <- predict(fit3, n.ahead = 12)
      
      fsed<-fcast3$fcst$sediment_mean
      fprcp<-fcast3$fcst$precipitation_mean
      fdisc<-fcast3$fcst$discharge_mean
      ff3<-data.frame(fsed[,1],fprcp[,1],fdisc[,1])
      
      year<-st + (i-1) + 1/12
      
      ff3<-ts(ff3,start=c(year,1),frequency = 12)
      
      rmse3[c(a:b),] <-sqrt((ff3-xtest)^2)
    }
     
    yr = rep(c(1983:1995),each =12)
    mn = rep(c("September", "October", "November", "December","January","February","March","April","May","June","July", "August"),13)
    
    rmse1 = data.frame(yr,mn,rmse1)
    names(rmse1) =c("Year", "Month","Sediment","Discharge","Precipitation")
    rmse2 = data.frame(yr,mn,rmse2)
    names(rmse2) =c("Year", "Month","Sediment","Discharge","Precipitation")
    rmse3 = data.frame(yr,mn,rmse3)
    names(rmse3) =c("Year", "Month","Sediment","Discharge","Precipitation")
    
    ### sediment
    ggplot() + 
      geom_line(data = rmse1, aes(x = Year, y = Sediment,colour="VAR(1)")) +
      geom_line(data = rmse2, aes(x = Year, y = Sediment,colour="VAR(3)")) +
      geom_line(data = rmse3, aes(x = Year, y = Sediment,colour="VAR(8)")) +
      labs(
        title = "CV RMSE for Sediment",
        x = "Date",
        y = "RMSE") +
      scale_color_manual(name = "VAR Models", values = c("VAR(1)" = "blue", "VAR(3)" = "red","VAR(8)" = "black"))

    Figure 7: Cross validation RMSE for sediment

    Code
    ## discharge
    ggplot() + 
      geom_line(data = rmse1, aes(x = Year, y = Discharge,colour="VAR(1)")) +
      geom_line(data = rmse2, aes(x = Year, y = Discharge,colour="VAR(3)")) +
      geom_line(data = rmse3, aes(x = Year, y = Discharge,colour="VAR(8)")) +
      labs(
        title = "CV RMSE for Discharge",
        x = "Date",
        y = "RMSE") +
      scale_color_manual(name = "VAR Models", values = c("VAR(1)" = "blue", "VAR(3)" = "red","VAR(8)" = "black"))

    Figure 8: Cross validation RMSE for stream discharge

    Code
    ## discharge
    ggplot() + 
      geom_line(data = rmse1, aes(x = Year, y = Precipitation,colour="VAR(1)")) +
      geom_line(data = rmse2, aes(x = Year, y = Precipitation,colour="VAR(3)")) +
      geom_line(data = rmse3, aes(x = Year, y = Precipitation,colour="VAR(8)")) +
      labs(
        title = "CV RMSE for Precipitation",
        x = "Date",
        y = "RMSE") +
      scale_color_manual(name = "VAR Models", values = c("VAR(1)" = "blue", "VAR(3)" = "red","VAR(8)" = "black"))

    Figure 9: Cross validation RMSE for precipitation

    Code
    var3 <- vars::VAR(ts_obj, p=3, type='both')
    acf(residuals(var3))
    serial.test(var3, lags.pt=10, type="PT.asymptotic")
    
        Portmanteau Test (asymptotic)
    
    data:  Residuals of VAR object var3
    Chi-squared = 76.221, df = 63, p-value = 0.1225

    Figure 10: VAR(3) ACF

    From Figure 7 ,Figure 8, and Figure 9, we can see the value of p has little impact on stream discharge. For precipitation, VAR(8) has lower RMSE compared to VAR(1) and VAR(3). For sediment, the results varied by time. VAR(1) perform better with lower RMSE around 1982-1985m 1990, 1992, 1995, while VAR(1) and VAR(8) both have lower RMSE in some years. The residuals for VAR(3) model pass the test for serial correlation, while others failed. ACF plots for Figure 10 looks ok. Thus, I choose VAR(3) for the overall model.

    Forecast

    Code
    forecast(var3) %>%
      autoplot() + xlab("Year")

    Figure 11: VAR(3) forecast

    Conclusion

    The VAR models show that the long term correlation caused by water cycle is not significant or cannot be detected. The precipitation is not affected by discharge or sediment. The discharge is not affected by sediment. The strong relationships are still one way. Thus, SARIMAX model may be more suitble for these variables with sediment as dependent variables, and precipitation and discharge plus temperature as the independent variables.

    References

    De Vente, Joris, Jean Poesen, Gert Verstraeten, Gerard Govers, Matthias Vanmaercke, Anton Van Rompaey, Mahmood Arabkhedri, and Carolina Boix-Fayos. 2013. “Predicting Soil Erosion and Sediment Yield at Regional Scales: Where Do We Stand?” Earth-Science Reviews 127: 16–29.
    Foster, GR, and LD Meyer. 1977. “Soil Erosion and Sedimentation by Water–an Overview.” In ASAE Publication No. 4-77. Proceedings of the National Symposium on Soil Erosion and Sediment by Water, Chicago, Illinois, December 12-13, 1977.
    Mukundan, Rajith, DC Pierson, EM Schneiderman, DM O’donnell, SM Pradhanang, MS Zion, and AH Matonse. 2013. “Factors Affecting Storm Event Turbidity in a New York City Water Supply Stream.” Catena 107: 80–88.
    Pejman, AH, GR Nabi Bidhendi, AR Karbassi, N Mehrdadi, and M Esmaeili Bidhendi. 2009. “Evaluation of Spatial and Seasonal Variations in Surface Water Quality Using Multivariate Statistical Techniques.” International Journal of Environmental Science & Technology 6: 467–76.
    Roosmalen, Lieke van, Britt SB Christensen, and Torben O Sonnenborg. 2007. “Regional Differences in Climate Change Impacts on Groundwater and Stream Discharge in Denmark.” Vadose Zone Journal 6 (3): 554–71.
    Wang, Xixi. 2014. “Advances in Separating Effects of Climate Variability and Human Activity on Stream Discharge: An Overview.” Advances in Water Resources 71: 209–18.